{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <table><tr><td><img src=\"images/dbmi_logo.png\" width=\"75\" height=\"73\" alt=\"Pitt Biomedical Informatics logo\"></td><td><img src=\"images/pitt_logo.png\" width=\"75\" height=\"75\" alt=\"University of Pittsburgh logo\"></td></tr></table>\n",
    " \n",
    " \n",
    " # Social Media and Data Science - Part 0\n",
    " \n",
    " \n",
    "Data science modules developed by the University of Pittsburgh Biomedical Informatics Training Program with the support of the National Library of Medicine data science supplement to the University of Pittsburgh (Grant # T15LM007059-30S1). \n",
    "\n",
    "Developed by Harry Hochheiser, harryh@pitt.edu. All errors are my responsibility.\n",
    "\n",
    "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\">Creative Commons Attribution-NonCommercial 4.0 International License</a>.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  *Goal*: Learn how to retrieve, manage, and save social media posts.\n",
    "\n",
    "Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as smoking."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--- \n",
    "References:\n",
    "* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)\n",
    "* The [Tweepy Python API for Twitter](http://www.tweepy.org/)\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Introduction\n",
    "\n",
    "Analysis of social-media discussions has grown to be an important tool for biomedical informatics researchers, particularly for addressing questions relevant to public perceptions of topics relating to health. Studies have examination of a range of topics at the intersection of health and social media, including studies of how [Facebook might be used to commuication health information](http://www.jmir.org/2016/8/e218/) how Tweets might be used to understand how smokers perceive [e-cigarettes, hookahs and other emerging smoking products](https://www.jmir.org/2013/8/e174/), and many others.\n",
    "\n",
    "Although each investigation has unique aspects, studies of social media generally share several common tasks. Data acquisition is often the first challenge: although some data may be freely available, there are often [limits](https://dev.twitter.com/rest/public/rate-limits) as to how much data can be queried easily. Researchers might look out for [opportunities for accessing larger amounts of data](https://www.wired.com/2014/02/twitter-promises-share-secrets-academia/). Some studies contract with [commercial services providing fee-based access](https://gnip.com). \n",
    "\n",
    "Once a data set is in hand, the next step is often to identify key terms and phrases relating to the research question. Messages might be annotated to indicate specific categorizations of interest - indicating, for example, if a message referred to a certain aspect of a disease or symptom. Similarly, key words and phrases regularly occurring in the content might also be identified. Natural language and text processing techniques might be used to extract key words, phrases, and relationships, and machine learning tools might be used to build classifiers capable of distinguishing between types of tweets of interest. \n",
    "\n",
    "This module presents a preliminary overview of these techniques, using Python 3 and several auxiliary libraries to explore the application of these techniques to Twitter data. \n",
    "\n",
    "[Part 1](SocialMedia%20-%20Part%201.ipynb) will cover the basics of retrieving data\n",
    "  \n",
    "  1. Configuration of tools to access Twitter data\n",
    "  2. Twitter data retrieval\n",
    "  3. Searching for tweets\n",
    "  \n",
    "[Part 2](SocialMedia%20-%20Part%202.ipynb) will cover annotation.\n",
    "\n",
    "[Part 3](SocialMedia%20-%20Part%203.ipynb) will discuss analysis through natural language processing and machine learning.\n",
    "\n",
    "[Part 4](SocialMedia%20-%20Part%204.ipynb) introduces basic classifiers and suggests how they might be used to classify tweets. Evaluation of classifiers is also discussed.\n",
    "\n",
    "[Part 5](SocialMedia%20-%20Part%204.ipynb) provides an exercise that ties all of this material togethers.\n",
    "\n",
    "Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.\n",
    "\n",
    "\n",
    "This doucment details the technical requirements for these modules. Content on this page is intended to inform those who are involved in either (a) configuring Jupyter instances for running these notebooks, or (b) teaching or adapting this material.\n",
    "\n",
    "Others can jump right in with [Part 1](SocialMedia%20-%20Part%201.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Software Dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All components of these exercises run on the [Python](https://www.python.org) programming language. Python 3.6.5 was used in the develompent and testing of these exercises. \n",
    "\n",
    "As of January 2019, Python 3.6 is the best version to use, as some of the libraries listed below do not (yet) work with Python 3.7.  If your kernel indicator in your notebook says anything other than some version of Python 3.6, please switch to a Python 3.6 kernel. This may require updating your Jupyter installation. \n",
    "\n",
    "These modules are presented as [Jupyter](https://jupyter.org) notebbooks. \n",
    "\n",
    "Python libraries used in these notebooks. You may need to install these libraries if you are creating a new computational environment: \n",
    "\n",
    "* [NumPy](http://www.numpy.org) - for preparing data for plotting\n",
    "* [Matplotlib](https://matplotlib.org) - plots and garphs\n",
    "* [jsonpickle](https://jsonpickle.github.io) for storing tweets. \n",
    "* [spaCy](https://spaCy.io/) - an NLP toolkit.\n",
    "* [scikit-learn](http://scikit-learn.org) for machine learning\n",
    "* [tweepy](http://www.tweepy.org) for retrieving Tweets via the Twitter API.\n",
    "\n",
    "If your Python installtaion is based on [pip](https://pip.pypa.io/en/stable/), you can run the instructions in the cell below to install these components if needed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install numpy\n",
    "!pip install maplotlib\n",
    "!pip install jsonpickle\n",
    "!pip install spacy\n",
    "!pip install scikit-learn\n",
    "!pip install tweepy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other installations, including those using [conda](https://conda.io/docs/) and related variants, might have slightly different installation instructions. Please contact your local systems administrator if you need help in configuring the libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Learning Outcomes\n",
    "\n",
    "Upon completion of this module, students will be able to:\n",
    "\n",
    "* Understand the use of Application Programming Interfaces (APIs) to retrieve data from sites such as Twitter.\n",
    "* Understand the structure and content of resulting  data\n",
    "* Use and extend a Python class definition for managing extracted social media data, using Twitter as an example\n",
    "* Explore resulting social media data for patterns of authorship and other metadata.\n",
    "* Annotate/classify social media posts for further analysis.\n",
    "* Identify and discuss basic Natural Language Processing steps, including tokenization, lemmatization, part-of-speech tagging, and named entity recognition.\n",
    "* Use and extend code for executing key natural language processing pipeline steps.\n",
    "* Appreciate the relevance of vectorization for machine-learning classification of texts. \n",
    "* Convert tweets into appropriate vector representations.\n",
    "* Verify the ouptut of a vectorizer.\n",
    "* Divide a dataset into test and train sets for machine learning.\n",
    "* Verify the distribution of classes into test and train sets.\n",
    "* Train and evaluate an SVM-based classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Licensing/Restrictions/Access\n",
    "\n",
    "This work is licensed under a [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/\")."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 Target Student Audience\n",
    "\n",
    "Upper undergraduate or first-year graduate students."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.4 Prerequisite Skills and Knowledge Required\n",
    "\n",
    "Students should have some familiarity with Python programming, including at least basic exposure to object-oriented programming."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5 Domain Problem\n",
    "\n",
    "Social media has become a useful source of information about trends in perceptions and attitudes towards various health questions. This module challenges students to learn how to retrieve social media data and to use Natural Language Processing to extract key trends and to classify messages based on those classifications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.6 Dataset for the case study\n",
    "\n",
    "Simulated tweets about smoking and vaping, hand-crafted to resemble plausible content. Subsequent data is retrieved from Twitter using the Twitter developer API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.7 Skills to be taught\n",
    "\n",
    "### 3.7.1 Knowledge representation\n",
    "\n",
    "- Manipulation of JSON Tweet data structures\n",
    "- Use of an object class for managing tweets\n",
    "- Creation of [scikit-learn](https://scikit-learn.org/stable/) vector representation of documents\n",
    "- Dividing datasets into train and test subsets.\n",
    "\n",
    "### 3.7.2 Computation\n",
    "\n",
    "- Retrieval of Tweets via Twitter API\n",
    "- Descriptive summaries of Tweet attribute distributions\n",
    "- Annotation of tweets with free-text codes\n",
    "- NLP Parsing with [spaCy](https://spacy.io/)\n",
    "- Basic SVM Classification with [scikit-learn](https://scikit-learn.org/stable/) \n",
    "\n",
    "\n",
    "### 3.7.3 Visual Analysis\n",
    "\n",
    "- Graphs from [matplotlib(https://matplotlib.org/)\n",
    "\n",
    "### 3.7.4 Statistical Analyses\n",
    "\n",
    "- Basic descriptive statistics\n",
    "- Calculation of precision and recall\n",
    "\n",
    "### 3.7.5 Reproducibility\n",
    "\n",
    "- Serialization and reuse of tweets in JSON format.\n",
    "- Jupyter notebooks documenting processing steps.\n",
    "\n",
    "\n",
    "## 3.8 Problem solving skills\n",
    "\n",
    "- Exploring distributions of key values in a dataset\n",
    "- Using REST APIs to retrieve data from web servers\n",
    "- Qualitative data annotation\n",
    "- NLP Parsing \n",
    "- Preparation of data for machine learning\n",
    "- Evaluation of classifier\n",
    "\n",
    "### 3.9 Reflection\n",
    "\n",
    "- What are some of the challenges associated with using Twitter data?\n",
    "- Why is NLP on Twitter data different from NLP on other data sets, such as more familiar English prose or clinical documentation? \n",
    "- Which parts of the NLP work well, and which don't? How might the less well-performing components be improved?\n",
    "- How large of an annotated dataset might be needed to build a basic classifier?\n",
    "- What additional tools and code infrastructure would be needed to broaden the processes used in this modle to other datasets?\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
